Gate MoE fusion passes to XeLP+ and align pipeline comments by Copilot · Pull Request #5 · peterchen-intel/openvino

Copilot · 2026-06-02T05:57:44Z

Details:

Problem summary: The MoE fusion block in transformations_pipeline.cpp needed to execute conditionally by GPU architecture, and nearby comments no longer matched the gated behavior.
Behavior change (MoE fusion gating): Restrict MoeOpFusion, FuseMOESharedExpert, and FuseMOE3GemmCompressed registration to xe_lp and newer.
Documentation-in-code update: Clarified comments to distinguish always-on MoE conversion passes from XeLP+-only MoE fusion passes.

Snippet:

if (!disable_moe_opt && device_info.arch >= cldnn::gpu_arch::xe_lp) {
    const bool has_batch_dim = !is_pa;
    manager.register_pass<ov::pass::MoeOpFusion>(has_batch_dim);
    manager.register_pass<ov::intel_gpu::FuseMOESharedExpert>();
    manager.register_pass<ov::intel_gpu::FuseMOE3GemmCompressed>();
}

Tickets:

N/A

AI Assistance:

AI assistance used: yes
AI was used to implement the conditional guard and update surrounding comments; human validation was performed via code review of the affected section and consistency checks against nearby transformation-pipeline logic.

causing it to be skipped on MTL-class iGPU (12.70.x, XeHPG, no DPAS). This left raw FP32 weight-decompression chains that overwhelmed propagate_constants with ~56 GB of constant-folding memory. Root cause of inference failure: moe_3gemm_swiglu_opt uses oneDNN internally (onednn_linear for gate/up/down matrix multiplications). OneDNN requires an in-order OCL queue. MTL uses out-of-order queue by default because use_onednn is false when supports_immad=false. Fix: three MoE transformation passes (FuseVectorizedMOE3GEMM, ConvertMOEToMOECompressed, FuseMOE3GemmCompressed) run on all architectures. FuseMOE3GemmCompressed creates MOE3GemmFusedCompressed which the OCL moe_3gemm_swiglu_opt kernel executes. - Detect MOE3GemmFusedCompressed in apply_model_specific_options and force use_onednn=true so finalize_impl sets queue_type=in_order, satisfying the oneDNN in-order queue requirement. - Fix moe_gather validate_impl to accept rank-2 input for models where the batch dimension is pre-flattened (Qwen3-style). - Re-apply iGPU transfer skip (usm_shared -> usm_device) in network.cpp and program.cpp for integrated GPUs where both allocation types share system DRAM (xe2+ or 12.7x-class MTL/ARL-S). Tested on machine (GPU uArch 12.70.4 / XeHPG / System memory 64 GB): model loads in 14 s, generates meaningful tokens, Unevictable stays below 120 MB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Chen Peter <peter.chen@intel.com>

Instead of setting use_onednn=true when MOE3GemmFusedCompressed is detected, set m_queue_type=in_order directly. This is more precise: the only requirement is an in-order OCL command queue (for onednn_linear in moe_3gemm_swiglu_opt.cpp), not full oneDNN enablement for the whole model. Leaving use_onednn=false on non-systolic hardware (MTL, 12.70.x) ensures that oneDNN implementations for FC, convolution, GEMM etc. are not activated on hardware without DPAS units. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The MTL-class (12.70.x) iGPU has a separate GPU L3 cache from the CPU, so copying usm_shared -> usm_device does improve GPU access performance. Reverts the MTL condition added in the prior fix commit, keeping only the original xe2+ integrated GPU skip (which has true unified memory). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In the correct fix, GEMM3_SWIGLU (Qwen3) always goes through FuseMOE3GemmCompressed -> MOE3GemmFusedCompressed, which creates a single fused primitive with no standalone moe_gather node. The rank-2 accept was only needed during an intermediate broken debug state where FuseMOE3GemmCompressed was wrongly blocked. moe_gather is only used by GEMM2_BIAS_SWIGLU_CLAMP models, whose input is rank-3. Restore original: input_pshapes.rank() != 3 || input_pshapes[2].is_dynamic() Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The previous fix set m_queue_type = in_order directly in apply_model_specific_options, but this left m_use_onednn = false. On non-systolic hardware (supports_immad=false), program.cpp only calls lo.enable_onednn_for<lstm_seq/gru_seq>() (making the onednn_impls_optimization_attribute non-empty, which triggers create_onednn_engine() in select_preferred_formats.cpp) when use_onednn=true. With use_onednn=false, the engine is never initialized, causing moe_3gemm_fused_compressed to crash at inference time with 'oneDNN engine not initialized'. Fix: set m_use_onednn = true (not queue_type) when MOE3GemmFusedCompressed is detected. finalize_impl then sets queue_type = in_order because use_onednn=true, and the create_onednn_engine() call is correctly triggered. This is safe on non-systolic hardware: FuseVectorizedFC (systolic FC) is gated independently on supports_immad, so no systolic ops are introduced by enabling use_onednn for the MoE path. Verified: all 3 prompts pass with correct output on MTL iGPU (GPU_UARCH_VERSION=12.70.4, supports_immad=false). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

paged_attention_opt__multi_tokens allocates a tmp_out scratch buffer sized total_tokens * heads_num * v_head_size * num_of_partitions * sizeof(float). For Qwen3-30B with chunk_size=4096 and 8K KV context this is 2 GB per layer. With 48 layers all executing sequentially, this totalled 96 GB of demand-paged USM device allocation. On Intel iGPU (ARLS, i915 driver), the driver pins the entire allocation as Unevictable on first GPU access regardless of pages touched, causing CL_OUT_OF_RESOURCES on a 31 GB machine. Root cause: can_share_internal_buffer(false) in paged_attention_node unconditionally blocked the memory pool for ALL internal buffers. This was added in PR openvinotoolkit#33204 to prevent CPU/GPU races on lockable buffers (blocks_indexes_start/end, blocked_gws_subseq_mapping) written by prepare_internal_buffers(). However it also blocked pool reuse for non-lockable GPU-only buffers (exp_sums, max_logits, tmp_out) which are safe to share across sequential layers. Fix: - Remove can_share_internal_buffer(false) from paged_attention_node; per-buffer lockability already tracked via BufferDescriptor::m_lockable. so CPU-written (lockable=true, usm_host) buffers remain non-shareable while GPU-only (lockable=false, usm_device) buffers can be reused from the pool. - In allocate_internal_buffers(): pass buffer_descs[i].m_lockable to the call (previously dropped, causing wrong alloc type on initial allocation). Result: 48 layers share one 2 GB tmp_out buffer instead of allocating 48 separate 2 GB buffers. Peak Unevictable drops from OOM crash (~28+ GB) to ~18.9 GB on ARLS (Intel Arc 8086:7d67, Arrow Lake-S iGPU, 31 GB). Verified: Qwen3-30B-A3B-Instruct-2507-int4-ov with chunk_size=4096, 8K prompt, ContinuousBatching on ARLS completes successfully with exit code 0 and 20 coherent output tokens. Not affected on ARLH (supports_immad=true takes micro_sdpa path which does not allocate tmp_out at all). Signed-off-by: Chen Peter <peter.chen@intel.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Run the compressed MoE fusion chain unconditionally in TransformationsPipeline::apply() instead of gating it on supports_immad. This keeps the non-systolic iGPU path aligned with the systolic path for Qwen3-style compressed MoE models. What changes - always register ConvertTiledMoeBlockToGatherMatmuls - always register ConvertGatherMatmulToGatherMatmulCompressed - keep MoeOpFusion/FuseMOESharedExpert/FuseMOE3GemmCompressed behind disable_moe_opt only, not supports_immad - trim temporary debug instrumentation and leave only the functional change - simplify nearby comments to describe the required behavior concisely Why On non-systolic Intel iGPU (supports_immad=0), gating the compressed MoE preparation passes on supports_immad breaks the fusion chain before MOECompressed/MOE3GemmFusedCompressed can be formed. As a result, Qwen3-30B-A3B OV FP16-4BIT long-context runs show incorrect execution behavior and memory issues on the unfused path. This change preserves the intended compressed MoE graph rewrite chain on all GPU architectures while leaving the actual backend kernel selection to the lower-level capability checks. Validation - ARLHx iGPU, non-systolic - rebuilt openvino_intel_gpu_plugin successfully - deployed rebuilt plugin into the OV package runtime directory - short prompt sanity test passed: generate done - 32K prompt regression with max_num_batched_tokens=1024 passed on the deployed plugin: generate done, max_unevictable_gb=23.063, min_memavail_gb=36.323, max_vms_gb=34.855, max_rss_gb=1.176, no threshold kill observed Signed-off-by: Chen Peter <peter.chen@intel.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

peterchen-intel and others added 17 commits April 13, 2026 06:54

Merge branch 'master' into oom/fixing

8a030fd

Merge branch 'master' into oom/fixing

2004557

Merge branch 'master' into oom/fixing

c9eb5c6

Merge branch 'master' into oom/fixing

046c575

Merge branch 'master' into oom/fixing

3ebb016

Roll back the mistaken change

990c0bc

Merge branch 'master' into oom/fixing

39ba7fc

Merge branch 'master' into oom/fixing

1284765

Support oneDNN known gpu_arch only

f8d0cc3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Guard MoE fusion passes by XeLP arch

630c999

Copilot AI assigned Copilot and peterchen-intel Jun 2, 2026

Copilot created this pull request from a session on behalf of peterchen-intel June 2, 2026 05:57 View session

Copilot started work on behalf of peterchen-intel June 2, 2026 07:41 View session

Clarify comments for XeLP-gated MoE fusion

9dcc9b0

Copilot AI changed the title ~~Gate MoE fusion registrations to XeLP+ GPUs~~ Gate MoE fusion passes to XeLP+ and align pipeline comments Jun 2, 2026

Copilot AI requested a review from peterchen-intel June 2, 2026 07:43

Copilot finished work on behalf of peterchen-intel June 2, 2026 07:43

peterchen-intel force-pushed the oom/fixing branch 2 times, most recently from 3a593df to 1722fcf Compare June 13, 2026 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gate MoE fusion passes to XeLP+ and align pipeline comments#5

Gate MoE fusion passes to XeLP+ and align pipeline comments#5
Copilot wants to merge 18 commits into
oom/fixingfrom
copilot/add-condition-for-code-snippet

Copilot AI commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

AI Assistance:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 2, 2026 •

edited

Loading